Project 5 - Prosper loan data EDA by Brian Chase

Introduction

The data I explore below is Loan Data from Prosper last updated 03/11/2014. It contains 113,937 loans with 81 variables on each loan.

I picked this data set since I am interested in financial data sets and the challenge of finding how to get a good return or—in this case—finding the highest performing loans that do not default.

It turned into a supervised machine learning problem since it hinges on predicting what variables are important in predicting delinquencies.

In the end, I focused on the other side of the loan market. What influences Borrower APR and what are the differences in the different scoring systems that I discovered.

I focused on variables that would describe the borrower at the time of applying for the loan. For example, what is the credit score or how many delinquencies did the borrower have—in contrast to data collected after the loan like the number of investors.

I neglected variable LoanOriginalAmount originally and if I would return to the analysis, I’d explore that variable in more detail.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","AA","A","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","AA","A","B",..: 1 3 1 3 6 4 7 5 2 2 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

The data includes 81 variables and 113937 observations.

Univariate plots section

##          AA     A     B     C     D     E    HR    NC 
## 84984  3509  3315  4389  5649  5153  3289  3508   141

A critical variable for a loan is the credit grade. Here we can see that a large amount of the data is unlabled. That is because they used a different variable after July 2009. Let’s look at the ‘credit grade’ equivalent after July 2009 which is ProsperRating.

##          AA     A     B     C     D     E    HR 
## 29084  5372 14551 15581 18345 14274  9795  6935

The histogram for prosper rating also has missing values. My hypothesis is that it is the data points that were accounted by CreditGrade. Next I’ll remove all the grades/ratings that have the value “” (not labeled)

Here we see for both time periods the grade/rating with the highest frequency is the ‘C’ rating.

It also looks like the ratio to higher grade/rating is skewed toward higher quality loans, but it is hard to tell since the ProsperRating variable has a lower ratio of AA rated loan notes. They may have also changed the requirements for the different classifications when they changed the variable name. I’d have to do more research to check if they can be compared.

## Source: local data frame [8 x 3]
## 
##   ProsperRating..Alpha. no_rows ratio_pct
##                  (fctr)   (int)     (dbl)
## 1                         29084 34.275747
## 2                    AA    5372  6.330949
## 3                     A   14551 17.148480
## 4                     B   15581 18.362344
## 5                     C   18345 21.619742
## 6                     D   14274 16.822033
## 7                     E    9795 11.543493
## 8                    HR    6935  8.172958

Here the percentage of notes above C is 41.84

## Source: local data frame [8 x 3]
## 
##   CreditGrade no_rows  ratio_pct
##        (fctr)   (int)      (dbl)
## 1          AA    3509 12.1196422
## 2           A    3315 11.4495907
## 3           B    4389 15.1590509
## 4           C    5649 19.5109315
## 5           D    5153 17.7978102
## 6           E    3289 11.3597900
## 7          HR    3508 12.1161883
## 8          NC     141  0.4869962

And for credit grade, the notes above C is 38.73. So my visual intuition was correct—but the difference is not much.

## 
##    12    36    60 
##  1614 87778 24545
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00

There are three values for Term. 12 months, 36 and 60 months

## Source: local data frame [12 x 3]
## 
##                LoanStatus no_rows   ratio_pct
##                    (fctr)   (int)       (dbl)
## 1               Cancelled       5  0.00438839
## 2              Chargedoff   11992 10.52511476
## 3               Completed   38074 33.41671274
## 4                 Current   56576 49.65551138
## 5               Defaulted    5018  4.40418828
## 6  FinalPaymentInProgress     205  0.17992399
## 7    Past Due (>120 days)      16  0.01404285
## 8    Past Due (1-15 days)     806  0.70740848
## 9   Past Due (16-30 days)     265  0.23258467
## 10  Past Due (31-60 days)     363  0.31859712
## 11  Past Due (61-90 days)     313  0.27471322
## 12 Past Due (91-120 days)     304  0.26681412
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

In the LoanStatus variable, we can see that most loans are current and the majority of the ended loans are completed. But there is an interesting level of chargedoff and defaulted loans. When you add chargedoff and Defaulted variable counts together, it is shy of half the completed variable count. This seems to be a high number for unsecured loans. Does grade matter when it comes to what is completed and what is chargedoff/defaulted?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

## Source: local data frame [21 x 3]
## 
##    ListingCategory..numeric. no_rows ratio_pct
##                        (int)   (int)     (dbl)
## 1                          1   58308 51.175650
## 2                          0   16965 14.889808
## 3                          7   10494  9.210353
## 4                          2    7433  6.523781
## 5                          3    7189  6.309627
## 6                          6    2572  2.257388
## 7                          4    2395  2.102039
## 8                         13    1996  1.751845
## 9                         15    1522  1.335826
## 10                        18     885  0.776745
## ..                       ...     ...       ...

From variable explaination spreadsheet: “The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans”

It looks like debt considation is by far and away the largest category. After that is NA and Other. Then there is Home improvement loans and business loans.

##                                                        Accountant/CPA 
##                               3588                               3233 
##           Administrative Assistant                            Analyst 
##                               3688                               3602 
##                          Architect                           Attorney 
##                                213                               1046 
##                          Biologist                         Bus Driver 
##                                125                                316 
##                         Car Dealer                            Chemist 
##                                180                                145 
##                      Civil Service                             Clergy 
##                               1457                                196 
##                           Clerical                Computer Programmer 
##                               3164                               4478 
##                       Construction                            Dentist 
##                               1790                                 68 
##                             Doctor                Engineer - Chemical 
##                                494                                225 
##              Engineer - Electrical              Engineer - Mechanical 
##                               1125                               1406 
##                          Executive                            Fireman 
##                               4311                                422 
##                   Flight Attendant                       Food Service 
##                                123                               1123 
##            Food Service Management                          Homemaker 
##                               1239                                120 
##                           Investor                              Judge 
##                                214                                 22 
##                            Laborer                        Landscaping 
##                               1595                                236 
##                 Medical Technician                  Military Enlisted 
##                               1117                               1272 
##                   Military Officer                        Nurse (LPN) 
##                                346                                492 
##                         Nurse (RN)                       Nurse's Aide 
##                               2489                                491 
##                              Other                         Pharmacist 
##                              28617                                257 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                199                               1578 
##                     Postal Service                          Principal 
##                                627                                312 
##                       Professional                          Professor 
##                              13628                                557 
##                       Psychologist                            Realtor 
##                                145                                543 
##                          Religious                  Retail Management 
##                                124                               2602 
##                 Sales - Commission                     Sales - Retail 
##                               3446                               2797 
##                          Scientist                      Skilled Labor 
##                                372                               2746 
##                      Social Worker         Student - College Freshman 
##                                741                                 41 
## Student - College Graduate Student           Student - College Junior 
##                                245                                112 
##           Student - College Senior        Student - College Sophomore 
##                                188                                 69 
##        Student - Community College         Student - Technical School 
##                                 28                                 16 
##                            Teacher                     Teacher's Aide 
##                               3759                                276 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                120                                477 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                951                                102 
##                       Truck Driver                    Waiter/Waitress 
##                               1675                                436
## Source: local data frame [68 x 2]
## 
##                  Occupation no_rows
##                      (fctr)   (int)
## 1                     Other   28617
## 2              Professional   13628
## 3       Computer Programmer    4478
## 4                 Executive    4311
## 5                   Teacher    3759
## 6  Administrative Assistant    3688
## 7                   Analyst    3602
## 8                              3588
## 9        Sales - Commission    3446
## 10           Accountant/CPA    3233
## ..                      ...     ...

Excluding Other, the top occupations applying for loans is Professional—which could encompass some of other occupations. Then the next is Computer programmer and exective. It’d be interesting to see if professions were skewed toward different types of loans.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00    9.00    9.26   12.00   54.00    7604

The open credit lines peaks around 7 credit lines. The curve almost looks like a normal distribuition with a long tail.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.435   2.000 105.000     697

For inquires in the last 6 months, most fall with 0 to 3. But someone did have 105 inquires!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17600   19520 1436000    7604

The revolving credit balance looks like a power law distubution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Most loan applicants have a debt-to-incomeRatio from 0.15 to 0.25. The peak is around 0.25.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

BorrowerAPR median is 21% (rounded) with a mean of 21.9%.

LoanOriginalAmount almost looks like a normal distribution when transformed into a logrithmic scale—there are spikes where there are popular values.

Univariate Analysis

What is the structure of the dataset?

There are 81 variables and 113937 observations.

(worst) —————-> (best)

CreditRating: HR,E,D,C,B,A,AA

ProsperRating: HR,E,D,C,B,A,AA

Other observations: * 15% of notes are in the defaulted or chargedoff loan status * 51% of loans are listed in the Debt Consildation category. * The occupation listed on loans is usually “professional” * Most loan applicants have a debt-to-incomeRatio from 0.15 to 0.25. The peak is around 0.25.

In terms of credit score, the middle rating/grade “C” is the most frequent

What is/are the main features(s) of interest in the dataset?

The main features are the CreditGrade/ProsperRating/ProsperScore (which one to use?!), Loan Status, and BorrowerAPR.

Lending club is a peer-to-peer lending marketplace where loan applicants are trying to get the lowest APR and the investors are trying to make the highest return—which comes down to figuring out the rate that unsecured loans will written off/defaulted.

Bivariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    4.00    6.00    5.95    8.00   11.00   29084

The chart above is faceted based on the ProsperRating..Alpha variable.

ProsperScore also measures the metric of risk (higher = less risk). It is interesting that the ProsperScores are varied for each ProsperRating. I was expecting that it would be a copy of ProsperRating. Why is this different?

There are only ProsperScores during the time period after CreditGrade…

##                           PR... PrspS BrAPR Term  LC... OpnCL InL6M DlL7Y
## ProsperRating..numeric.    1.00                                          
## ProsperScore               0.71  1.00                                    
## BorrowerAPR               -0.96 -0.67  1.00                              
## Term                       0.08  0.03 -0.01  1.00                        
## ListingCategory..numeric. -0.09 -0.01  0.13  0.00  1.00                  
## OpenCreditLines            0.11  0.00 -0.10  0.10 -0.11  1.00            
## InquiriesLast6Months      -0.26 -0.30  0.15 -0.11 -0.07  0.04  1.00      
## DelinquenciesLast7Years   -0.15 -0.10  0.16 -0.04  0.02 -0.18  0.09  1.00
## RevolvingCreditBalance     0.06  0.04 -0.06  0.03 -0.05  0.34 -0.01 -0.13
## DebtToIncomeRatio         -0.14 -0.15  0.06 -0.01 -0.04  0.09  0.02 -0.04
## LoanOriginalAmount         0.43  0.27 -0.32  0.34 -0.13  0.22 -0.10 -0.14
##                           RvlCB DbTIR
## ProsperRating..numeric.              
## ProsperScore                         
## BorrowerAPR                          
## Term                                 
## ListingCategory..numeric.            
## OpenCreditLines                      
## InquiriesLast6Months                 
## DelinquenciesLast7Years              
## RevolvingCreditBalance     1.00      
## DebtToIncomeRatio          0.04  1.00
## LoanOriginalAmount         0.19  0.01
## [1]  1.00

Going to look for the top 15 correlations for ProsperScore, ProsperRating..numeric., and Borrowing APR and see if there are any variables that I’ve been missing in a list.

##    Var1                         Var2       Freq
## 10    A                 ProsperScore  1.0000000
## 9     A      ProsperRating..numeric.  0.6713091
## 3     A                  BorrowerAPR -0.6624899
## 5     A                  LenderYield -0.6375801
## 4     A                 BorrowerRate -0.6371337
## 7     A                EstimatedLoss -0.6256541
## 6     A      EstimatedEffectiveYield -0.5605409
## 20    A         InquiriesLast6Months -0.4424450
## 13    A        CreditScoreRangeLower  0.3851986
## 14    A        CreditScoreRangeUpper  0.3851986
## 61    A                    Investors  0.3583201
## 32    A      TradesOpenedLast6Months -0.3267562
## 21    A               TotalInquiries -0.3203964
## 45    A   LoanMonthsSinceOrigination  0.3118994
## 50    A LP_CustomerPrincipalPayments  0.2996834

InquiriesLast6Months, CreditScoreRange, and investors variables look to be variables that most likely is not derived from other variables that correlate well with ProsperScore.

##    Var1                         Var2       Freq
## 9     A      ProsperRating..numeric.  1.0000000
## 3     A                  BorrowerAPR -0.9428773
## 5     A                  LenderYield -0.9302680
## 4     A                 BorrowerRate -0.9301399
## 7     A                EstimatedLoss -0.8947862
## 10    A                 ProsperScore  0.6713091
## 13    A        CreditScoreRangeLower  0.6547768
## 14    A        CreditScoreRangeUpper  0.6547768
## 47    A           LoanOriginalAmount  0.4692016
## 6     A      EstimatedEffectiveYield -0.4311090
## 61    A                    Investors  0.4227322
## 54    A        LP_GrossPrincipalLoss  0.3990820
## 55    A          LP_NetPrincipalLoss  0.3896026
## 52    A               LP_ServiceFees -0.3579420
## 48    A           MonthlyLoanPayment  0.3158218
## 29    A      AvailableBankcardCredit  0.2898297
## 50    A LP_CustomerPrincipalPayments  0.2890016
## 21    A               TotalInquiries -0.2868433
## 20    A         InquiriesLast6Months -0.2836018
## 28    A          BankcardUtilization -0.2743389

I could see that BorrowerAPR and ProsperRating..numeric. correlated highly from the previous tables above, so I plotted just the ProsperRating correlations and sorted them. CreditScoreRange, Investors (I am guessing this means the size of the loan since most people put in $25 increments to diversify—not going to count this on second thought), and Inquiries (both Total and the last 6 months).

It looks like there are a few outliers that extend far away from the other points (~100 inquiries). It is interesting that there are a few vertical lines. I wonder why that is. I want to see if I can get a closer look at the data excluding the outliers.

People have fewer inquiries to the lower end of the BorrowerAPR range (better credit), but it is rather uniform—with larger peaks in short bands at the very end. Also there is a strange gap around 0.36.

There looks to be less points between x = 0.36 to 0.375.

The median BorrowerAPR for Completed is the lowest of the LoanStatus categories. They are as I would expect. The PastDue(>120 days) category is interesting in that there is a larger jump in median borrower APR and a tighter range favoring higher APR. Could this be because the other ‘less past due’ notes jump to Defaulted or Chargedoff through some action on the borrower and higher borrowerAPR borrowers tend to negliect that action?

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

ProsperRating, BorrowerAPR, LenderYield, BorrowerRate, EstimatedLoss are highly correlated (greater than the absolute value of 0.85).

ProsperScore, CreditScoreRangeLower/Upper are slightly less correlated (greater than the absolute value of 0.65).

This leads one to question which variables are derived from other variables. It makes sense that they use the ProsperRating to determine the BorrowerAPR to a large extent and the Lender’s yield obvious is critically tied to what the borrower pays in interest rate—unless the Prosper the platform company charged wildly variying rates based on a different variable.

In comparing LoanStatus to the BorrowerAPR, there is lower median APR for completed and current loan notes when compared to the ones that are chargedoff/defaulted/past due.

Multivariate Plots Section

Delinquencies are less for smaller BorrowerAPR as expected. However, I was expecting the CreditGrade to match up with the borrower APR. There is a trend for the better CreditGrade notes being lower BorrowerAPR, but there is a great spread and overlap.

This looks more like I would expect! The ProsperRatings are tightly coupled in bins organized by BorrowerAPR. The Delinquencies for the last 7 years seem to matter most for AA and A ratings. There is some overlap of ProsperRating in the different BorrowerAPR bins, but it is much less than for Credit Grade.The bins are a little less defined for the high APRs.

There is an obvious trend toward darker, lower ProsperScore for higher BorrowerAPR. There looks to be some overlap of Scores. Making the dots more transparent, we can saee the tight distribution of deliquencies in the last 7 years for Borrower APR.

This is clear from this chart that the CreditGrade is dominated by CreditScoreRangeUpper (and potentially CreditScoreRangeLower). There are clear horizontal colored bands correlating to CreditScore bands. The HR and E CreditGrades bands overlap a little.

While the BorrowerAPR is trending how I would have guessed — AA for lower BorrowerAPR and E is the highest — there is significant overlap between CreditGrades and BorrowerAPR. How would they decide what BorrowerAPR to give for someone applying for a 17% loan?

Here we can see the same bands of ProsperRating binned by BorrowerAPR. Now we can see how that compares to CreditScore—where there is significant overlap.

Same chart colored with ProsperScore. There is a trend of lower prosperScore for lower BorrowerAPR and lower CreditScore.

Final Plots and Summary

Plot One

Description One

Loans that are Chagedoff, Defaulted or are Past Due have higher median BorrowerAPR. Completed has the lowest median and the notes that are >120 days past due have a slight uptick for median.

Plot Two

Description Two

CreditGrade colored data points falls into well-defined buckets of CreditScore with AA being the highest CreditScoreRangeUpper interval. In contrast, the variable BorrowerAPR there is a trend of lower BorrowerAPR for higher grade loan notes similar to Credit Score, but there are overlaps between the CreditGrade of the loan notes.

Plot Three

Description Three

ProsperRating colored data points falls into well defined buckets of Borrower’s APR with the highest rated notes AA having the lowest Borrower APR all the way down to HR rated loans notes having the highest Borrower APR. In contrast, the CreditScoreRangeUpper variable has a trend of lower BorrowerAPR for higher rated notes similar to BorrowerAPR, but there are overlaps between the ProsperRating sections.

Reflection

I ran into difficulties getting into the mindset to explore the data. My initial thought was that I was going to find some interesting patterns in the data that specifically made a borrower have a good or bad APR and figure out what makes a loan a good canidate to be completed or default. While it was a good starting point, I started to realize that I needed to keep a more open mind, because I was litterially feeling stressed when something profound did not jump out at me. Looking at expected relationships is fine and can be enlightening too!

I had more success when I started allowing myself to browse the data for expected relationships. I felt success when I realized that CreditGrade depended on the CreditScore and it was so plainly visible in the charts when compared to a chart of the same with ProsperScore. That was satisfying.

The analysis could be further enriched by exploring the ProsperScore and how that is created. Also I would like to create a classifier to figure out what variables most impact whether or not a note is a “good” loan note. I tried to create one in R where I binned good outcomes/most likely good outcomes and bad outcomes/most likely bad outcomes (half the past due notes split in each category based on how late they were), but my model crashed Rstudio and I did not follow through on debugging the reason why. The weights of that model could open new paths to explore.